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Abstract We address the slow convergence and poor stability of quasi-newton se- 
quential quadratic programming (SQP) methods that is observed when solving exper- 
imental design problems, in particular when they are large. Our findings suggest that 
this behavior is due to the fact that these problems often have bad absolute condition 
numbers. To shed light onto the structure of the problem close to the solution, we for- 
mulate a model problem (based on the yl-criterion) , that is defined in terms of a given 
initial design that is to be improved. We prove that the absolute condition number of 
the model problem grows without bounds as the quality of the initial design improves. 
Additionally, we devise a preconditioner that ensures that the condition number will 
instead stay uniformly bounded. Using numerical experiments, we study the effect of 
this reformulation on relevant cases of the general problem, and find that it leads to 
significant improvements in stability and convergence behavior. 

Keywords Sequential Quadratic Programming • Preconditioning • Design of 
Experiments 
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1 Introduction 

One of the most important aspects in model based optimization and investigation of 
real world processes is the estimation of parameters appearing in a model. Typically, 
one estimates these parameters solving a regression problem based on data collected 
from one or more experiments. Optimal experimental design is the task of choosing 
the best experimental setup from a set of possible ones, and according to a predefined 
criterion. As this task is bound to be constrained in non-trivial ways, and is formulated 
in terms of the optimality conditions of an underlying regression problem, it presents 
a rich class of challenging optimization problems. It is also a practically relevant class, 
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as the solution of these problems leads to important improvements in the efficiency of 
research and development [8"ll26| . 

A standard setting is the estimation of parameters using weighted least-squares 
regression. It is then natural to rate the experiments according to the quality of the 
Fisher information matrix, or of its inverse, the variance-covariance matrix. A variety 
of criteria to rate this quality exists, and they are traditionally named after letters of 
the alphabet. In this article we will focus mainly on the so called A- criterion, which is 
defined as the trace of the variance-covariance matrix of the regression. For a full list 
and further discussion we refer to the classic texts |21|[6], 

The experimental design problems we are concerned with here have the following 
form. We consider the nonlinear regression problem 

rn 

p = argminV' \F(p, q,h) - m \ 2 (1) 
p i=l 

where F is a nonlinear function depending on parameters p, controls q, and measure- 
ment points ti. Often these points refer to measurement times, but our scope is not 
restricted to this case. The values /Ltj represent the results of measurements, and F rep- 
resents the model under study. If the measurement errors are independent, normally 
distributed with variance one and zero mean, the parameters obtained from |TJ will 
be a random variable which, to a first approximation, is drawn from a multivariate 
normal distribution centered around the true parameter values p* , and with variance- 
covariance matrix S = (J T J)^ 1 , where the matrix J is given by 

Jij = g^J-F(P*>9>*i)- 

The experiment design problem we consider is the problem of finding controls q, and a 
subset of measurement points such that the trace of the variance-covariance matrix is 
minimal, at least for the parameters p that we believe to be plausible when designing 
the experiment. 

Our basic problem is thus, given a set of feasible controls 0, and a measurement 
budget m max < m, find a set M C {1,2,..., m} with cardinality #M = m ma x, and a 
q G O such that 

TrQj [M] (3) T J [M] (g)]~^ 

is minimal. Here, we have made explicit the dependence on q, and have denoted by 
J[M] the matrix J after deleting all rows whose index is not in the set M. 

There are a couple of approaches in the literature to solve this problem [15lH1fT6l 
6,8 . One important approach, and the one we will consider here, consists in using 
a relaxed formulation to obtain a constrained minimization problem in continuous 
variables, and then to apply a modern optimization method, typically in the form 
of a sequential quadratic programming (SQP) 13,20,19 algorithm to solve it. This 
approach was pioneered by |16lffi[T5]. and the formulation has the following form. 

Define W : R m -> ]R mxm through W(w) := diag(w), a diagonal matrix whose 
entries are the elements of the vector of weights uu. We define the set of admissible 
weights 

m 

f2(m m ax,m) := {w G [0, l] m | ^ Wj = m max } 

i=l 
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Fig. 1 Convergence behavior of an SQP algorithm solving a nonlinear experimental design 
problem. 

The problem now is 

minTr (^J T {q)W{w)J(q)j ^ (2a) 

subject to 

q e 0, we fi(m max ,m). (2b) 

The form of this problem is such that it can be given to an SQP solver as it is. It 
has the important advantage that the optimization of the controls and of the weights 
occurs at the same time. On the other hand, it has the disadvantage of not necessarily 
yielding integer solutions, although practical experience shows that either the solution 
is integer, or only a few weights are not, which can then be remedied by using a 
rounding technique 1 22]. The understanding of this phenomenon is incomplete, but see 
|24j for an important contribution on this issue. Alternatively, the weights in Q can 
be interpreted as a required precision of the measurements, i.e. as a measure of the 
maximum variance of the error which is acceptable. 

In practice, it turns out that solving the problem (I2J) in this way leads to poor 
convergence. Typical behavior can be seen in Figure HTwhere we have plotted the 
length of the search direction obtained from the quadratic problem against the iteration 
number (the precise description of the numerical experiment can be found in section|3|. 
It is important to remark that this difficulty appears also when there are no external 
controls, i.e. when the vector of controls q is empty. 

In what follows, we will attempt to shed some light on the question as to why this 
behavior emerges, and in particular onto how to solve it. To this end, inspired by the 
"test equation" [3] used in the theory of stiff ordinary differential equations, we tackle 
the generality of the setting by developing a model problem. In this model problem we 
discover arbitrarily bad absolute conditioning under fairly generic conditions. It turns 
out that this can be corrected by introducing a simple but nonobvious transformation. 
This transformation can also be applied to yielding a problem for which the SQP 
method converges in much less iterations than for the original one. 

While the connection between absolute condition numbers and the difficulty of solv- 
ing an optimization problem has been made before [271128) . its role in the convergence 
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behavior of SQP methods has not, to our knowledge, been investigated. Thus, while 
we cannot provide a complete theoretical justification for the slow convergence of SQP 
methods when solving (J2J , we provide below what we consider to be strong evidence for 
the hypothesis that it is due to bad absolute condition numbers. At the very least we 
provide, in the form of a left preconditioner, an effective way to significantly accelerate 
the numerical solution of experimental design problems. 



2 The model problem and its conditioning 

We are led to our model problem by removing elements of the full problem Q. The 
first simplification is the removal of the external controls. As a consequence, the matrix 
J is fixed, and we have a problem in a form that is addressed by classical texts (see e.g. 
[6]). Even this simplified problem is difficult to analyze, as the assumptions imposed 
on J in this theory are rather weak. Informally speaking, once it has full rank, J 
can be any old matrix. Since this class of problems is very general, one can expect 
to find counterexamples to any generalization of behavior observed in practice. Our 
approach will thus be to restrict ourselves to a model problem that is endowed with 
enough structure to understand its badly-conditioned nature, and to devise a method 
of curing it. 

This problem is as follows. Given initial information in the form of a variance- 
covariance matrix £ (which we, for simplicity, will consider to be a multiple a of 
the identity matrix) for the parameters of interest, our task is to choose between 
two additional observations using a relaxed formulation. The idea of this problem is to 
model an advanced stage of our experimental design, in which, for the sake of argument, 
all weights except for two are known to be either one or zero. The role of the parameter 
a is to model the quality of the initial design that is the starting point of our model 
problem. The smaller the a, the better the design that is to be improved. 

The optimization problem we will study is thus 



min Tr ( a 1 1 + wiviv'i + W2V2V2 \ ) (3a 

Ul,W2 \ L II 



W± ,U>2 

subject to 

w\ + W2 = 1 and < tOj < 1, i = 1, 2. (3b) 

Theorem 1 Suppose that v[v2 = and \\vi\\ — \\v2W = 1. Then the solution of 
problem Q is w\ = W2 = 1/2, and the absolute condition number of the solution is 

(1+ia) 3 

The use of the absolute condition number is due to the fact that we are searching 
for the zero of a derivative. The error in the "data" is thus a perturbation of the zero, 
which is only meaningful in an absolute sense. 

Proof We write first u>2 = 1 — tui to eliminate the equality constraint, and thus obtain 
a problem in one variable w. The function we want to minimize is then 

f(w) := Tr ( j^a -1 / + wvivi + (1 — w)u2ujj J . (5) 
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The orthogonality of v\ and V2 simplifies the application of the Sherman-Morrison 
formula to obtain 

f( W ):=2a-^ V ° P (1 7 W K ■ (6) 

J K ' l + wa 1 + (1 - w)a v ' 

Now we look for w* such that / (to*) = 0. Straight-forward algebraic arguments yield 
w* = 1/2. 

To investigate the effect of perturbations, we choose e > 0, and define the solution 
mapping g : (— e,e) — > R by /'((/(e)) = e. Of course, g(0) = to*, and the condition 
number of the problem / (to) = is given by [4] 

„ _ 1^(0)1 

abs " ls(o)| ■ 

Using implicit differentiation on / (<?(e)) - e = we obtain the expression 

Kabs= !/"(»*)! M ' (7) 

To finish the proof, we only need to compute f"(w) and verify the expression. 

Theorem[T]suggests that as the optimization progresses and our experimental design 
becomes better and better, that is, a becomes smaller, the absolute condition number 
of the problem will increase. It will behave as a~ 3 for small a. Informally speaking, the 
bottom of the valley in which the solution lies will become very flat, making it harder 
and harder to choose between two additional rows of roughly the same size. The model 
problem thus predicts the stagnation in the solution process, which is precisely what 
often occurs in practice. 

At least for the model problem, there is a surprisingly simple remedy in the form of 
a left preconditioner: A left preconditioner for a given problem is a diffeomorphism in 
the dependent variables that preserves the solution, and at the same time it improves 
the condition number. Its name comes from the fact that it appears to the left of the 
function that defines the unpreconditioned problem. The definition we use here is an 
extension of the definition that is commonly used in linear algebra (see e.g |23| ) to a 
nonlinear setting. The next theorem suggests a possible choice of a left preconditioner 
for the model problem. 

Theorem 2 The problem 

min — < Tr I a -1 / + wiVivf + W2V2V2 

Wl,W2 V L 



subject to 



w l + w 2 — 1 an d < tOj < 1, i = 1, 2. (8b) 



has the same minimum as For every a > the absolute condition number of the 
minimum is «; a bs — 2. 

Proof Repeat, with / := —f~ 2 , the calculation of the condition number of Theorem[l] 
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In light of the above, a possible way to achieve better convergence when solving 
|2| is to take inspiration in Theorem [5] and choose the left preconditioner 

h : (0, oo) ->• (-oo, 0) h{z):= -z~ 2 . 

As a consequence, we propose (and recommend) to solve the following problem instead 
of 

mm- ^v(^J T {q)W{w)J{q)\ * \ J (9a) 
subject to the constraints 

w G f2(m ma x,rn) and q £ 0. (9b) 

As we will see in the next section, this modification has the desired effect of accelerating 
the convergence of SQP methods when solving experimental design problems. 



3 Numerical experiments 

In what follows, we will compare experimentally how well the SQP solver behaves when 
solving the original problem |2]| versus the preconditioned problem j9J. In the first 
experiment, we tackle the task of optimizing a design when given a prior information, 
but without external controls q. In the second experiment, we keep m max constant and 
vary the number of candidate measurements to observe how the reformulation affects 
performance as the size of the problem changes. The third numerical experiment is a 
full nonlinear experimental design problem with external controls on a model defined 
through a system of ordinary differential equations. With it, we intend to illustrate the 
effect of the reformulation on practical problems. 

To make the results representative and reproducible, we programmed the SQP 
solver as it is described in |19| . with an augmented Lagrangian penalty function as 
described in |25j . This results in a reasonably robust and effective solver. The quadratic 
problems are solved using QPOPT [§] , and the Hessian is approximated using damped 
BFGS updates [19 . To avoid scaling issues as much as possible, we use as an initial 
Hessian approximation a diagonal matrix with the absolute values of the diagonal of the 
exact Hessian. This retains the diagonal of the exact Hessian in the unpreconditioned 
(convex) case, and assures positive definiteness in the preconditioned case. 

We developed a Radau Ha solver of order 5 based on the ideas in [12] that uses 
internal numerical differentiation [2] to compute accurate sensitivities of the differential 
equation (here: 1st to 3rd order). All derivatives of algebraic functions, including the 
inversion of the information matrix, were computed using automatic differentiation |1H 
IIP] , with a Common Lisp AD package 5_ extended for higher derivatives. The inversion 
of the information matrix was done by directly computing J T W(w)J and applying a 
Cholesky decomposition. To cope with stability issues, we used quad-double arithmetic 
|14j . This also ensures we can use the full range [0, 1] for the weights |17| . 

The problems without external controls are linear, and thus are defined through 
the matrix J, which we generate randomly. We do this by filling a matrix with random 
entries uniformly distributed in [—1,1], performing a singular value decomposition, 
and substituting its singular values with exponentially decaying ones chosen to obtain 
a condition number of 10 4 . 
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Fig. 2 Effect of prior information on stability and average iteration counts of the unprecon- 
ditioncd and preconditioned problems (subscripts u and p, respectively). See text for details. 



3.1 Effect of prior information 



In this first experiment, we want to observe the behavior of SQP methods on the 
sampling design problem for choosing 20 out of 50 possible measurement points: 



min Tr 



^1 + J T W(w)jj 



(10) 



with and without using the preconditioner. The matrix J G R 50x7 is chosen at random 
as described before, but we additionally normalize all its rows in the euclidean norm. 
We thus obtain a nontrivial problem that is similar to our model, and where we can 
observe the effect of prior information al directly. 

In Figure [2] we summarize the results of solving problem ( |10[ | with and without pre- 
conditioning for 200 different matrices and 11 different values of a, chosen equidistantly 
on a logarithmic scale in [10 -6 , 1]. As a decreases, and thus the prior information be- 
comes better and better, we see that the iteration count of the preconditioned variant 
stabilizes to around 40, while the sensitivity to the initial guess is about the same for 
each a. This distance was estimated by comparing the solution we obtained using two 
different starting guesses for the iteration. One that has the first 20 components equal 
to 1, and the rest is zero, and one starting guess that is the reverse. Since the problem is 
convex, we can use this sensitivity as a proxy for the distance to the exact solution. This 
value tends to be a lot larger for the unpreconditioned problem, which also becomes 
very hard to solve for small a. On the right of Figure[2]we have plotted the percentage 
of problems that could not be solved because the quadratic solver reached its default 
iteration limit, something that never occurred with the preconditioned formulation. 



3.2 Effect of problem size 

Now we consider matrices J of size 50n x 7 for n = 1,2, ...,10. For each size we 
generated 200 matrices, and solved the problem 



mm 

uGfi(20,50r 



^Tr Qj T W(w)jj ^ (11) 
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Fig. 3 Average iterations of the SQP method solving the unpreconditioned and preconditioned 
problem (subscripts u and p, respectively). 



with and without preconditioner. We plot average iteration counts in Figure [3] with 
error bars giving the standard deviations. Again we observe a significant improvement, 
on average, of the iteration counts for each problem. We also observe that, for the 
unpreconditionend problem, the iteration count increases with the problem size n, 
which is likely due to the fact that the value of the minimum becomes smaller with 
matrix size, so that the effect predicted by Theorem [I] increases. Whether the iteration 
counts increase or not for the preconditioned variant is not possible to tell conclusively 
from this experiment, but we conjecture that it does. 



3.3 Full nonlinear problem 

Our original motivation was to find an advantageous formulation for full nonlinear 
experimental design as it is relevant for applications in engineering. Thus we consider 
in our next experiment an experimental design problem defined on a system of nonlinear 
differential equations. For simplicity, we choose the FitzHugh-Nagumo [71118] model, 
which is given by the system 

x\ = x\ — zx\ — x 2 + I xi(to) = XQ t x, (12) 

±2 = a ■ (xi + b + cx 2 ) x 2 (t ) = x Q>2 . (13) 

The task is to find a experimental design to estimate the four fixed parameters z — 0.25, 
a = 0.02, b — 0.7, and c = —0.8 using the values of /, 2:0,1, and 10,2 as controls, which 
are expected to satisfy the constraints 

- 5 < x s < 5, -5 < x ,2 < 5, -1 < I < 0.5. (14) 

At most 30 measurements should be taken. A measurement is the reading of the value 
of any of the two variables at any of the times tj = 5i, i = 1, 2, . . . , 100. 

The initial guess for the weights is that all of them are equal. We choose the initial 
guess for the optimal controls randomly to be able to assess typical behavior as much 
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score 


(k P ) 


(feu) 


(J^u J fcp) 


(7 


5:0 


46.0 


260.8 


3.9 


1.7 



Table 1 Performance statistics for the FitzHugh-Nagumo example. The values of k denote 
iteration counts, the subscripts u and p indicate that they refer to unpreconditioned and 
preconditioned variants. The angled brackets indicate average, and a stands for the standard 
deviation of the speed-up factor (k u /k p ) 
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Fig. 4 Optimal design for the FitzHugh-Nagumo system 

as possible. Since it is possible to choose the controls in such a way that the least- 
squares problem is singular, we only use those for which Tr([J T (qo)^(9o)] _1 ) < 100. 
The optimal values are usually three to four orders of magnitude lower than this upper 
limit. 

We repeated the experiment five times, and summarize the results in table [l] We 
note again that the preconditioned formulation leads to important gains in convergence 
speed. Here we have that the problem is not convex for either formulation, and one 
observes that there exist many local minima. For a given starting value, the two variants 
may or may not converge to the same one. In Figure [l] we can observe the convergence 
behavior of a typical run in terms of the length of the search direction. For completeness, 
we include in Figure [4] a solution of this optimization problem (/ = 0.37, £0,1 = 5, 
£0,2 = 3.76), together with the optimal measurement points. 



4 Final remarks and outlook 

In this article, we have studied the convergence problems of SQP methods when solv- 
ing experimental design problems. Our findings suggest that bad absolute condition 
numbers are indeed the cause of slow convergence. 

We work around the generality of the problem by studying a carefully chosen 
model problem involving the yl-criterion. From the understanding gained about the 
conditioning of this model problem we derive a transformation that guarantees constant 
absolute condition numbers in this particular case. 

We then provide strong experimental evidence that this transformation yields sig- 
nificant improvements in the convergence behavior of SQP methods when applied to 
relevant settings, where we observe that the improvement is more significant for larger 
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problems. The transformation itself does not introduce any noticeable additional com- 
putational cost. 
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